Introduction: The diamond market has evolved significantly in recent years. According to the forecast of Kenny and Zimnisky, the prices for diamonds will rise by 5% to 10% in 2024 due to a decrease in the supply of diamonds compared to 2023. Furthermore, Russian mining giant Alrosa halted rough diamond sales for two months, contributing to a boost in prices. In addition, the latest action by the Group of Seven to ban Russian diamond exports is likely to lead to even more constrained supplies, which may cut the global diamond supply by up to 30% (Sor, 2024). This situation will impact the supply and demand factors and, in turn, affect the price that consumers pay for diamonds. At the moment, the cost of diamonds is still relatively high, with the price of one carat being $3,913 (StoneAlgo, 2024). Additionally, the market for engagement rings in Canada is expected to rise from $8.59 billion in 2021 to $11.92 billion in 2030 at a CAGR of 3.7% (Verified Market Research, 2022). This increase in diamond demand not only increases sales but also reinforces the status of diamonds as both a luxury item and an investment. Additionally, if we look at consumer behavior, it becomes evident that when consumers purchase diamonds, certain criteria are used for decision-making. Similarly, companies price diamonds based on these criteria as well. For instance, many consumers prioritize sparkle and brilliance, while factors such as depth percentage and width serve as critical metrics for determining diamond quality.

For the reasons detailed above, our analysis will explore a dataset from Kaggle that examines various metrics of diamonds and how they impact price. By analyzing these factors, our team hopes to provide models that can be used to predict the price of diamonds, thereby informing customers, companies, and investors alike. The dataset used in this study is called “Diamond Price Prediction.” It consists of 53,940 rows and 10 columns, which include carat, color, width (X), total depth percentage (Depth), cut, clarity, culet, weight (Y), and depth (Z). This dataset was chosen as it contains many rows or samples, which we hope will strengthen our analysis. Additionally, the dataset provides comprehensive information essential for predicting the impact of specific diamond characteristics on pricing, allowing us to meet our objective. In addition, there is a wide array of variables allowing us to explore the ones that have a linear relationship with price. All of these characteristics of the dataset make it suitable for EDA, regression, hypothesis testing, and understanding how certain attributes affect the price.

Upon conducting our EDA, we found that depth and width appear to have a linear relationship with price. Our group chose to focus on the depth and the width. This is due to the fact that the carat has a well-established relationship with price, making it a potentially redundant variable to explore. Depth is the measurement of a diamond from the table to the culet. It plays a crucial role in determining the overall appearance and value of a diamond. The width is the diameter of the diamond. The depth and width are both important factors in determining the value of a diamond, as they together determine how light enters and exits the diamond and thus the sparkle and brilliance of the diamond (Gemological Institute of America, n.d.). Therefore, the primary objective of our project is to analyze the relationship between diamond depth percentage and width, examining how these factors affect overall price. Additionally, we will identify linear relationships between various diamond characteristics and price, conducting regression analyses to enhance model accuracy and provide customers with clear and intuitive data visualizations.

Research Questions: How can we use the depth and width to predict the price of the diamond? Is there a linear relationship between the width of the diamond and the price? Is there a linear relationship between the depth of the diamond and the price? If a relationship exists for either variable, what direction is it in, and how strong is it? Are all assumptions of the model met? Is there any way to enhance our model? Hypotheses: As the width of the diamond increases, the price of the diamond will increase too. As the depth of the diamond increases, the price of the diamond will increase too.

Exploratory Data Analysis:

diamond.df = read.csv("/Users/nehaadnan/Desktop/diamonds.csv")
#diamond.df
price_summary <- summary(diamond.df$price)
carat_summary <- summary(diamond.df$carat)
depth_summary <- summary(diamond.df$depth)
table_summary <- summary(diamond.df$table)
x_summary <- summary(diamond.df$x)
y_summary <- summary(diamond.df$y)
z_summary <- summary(diamond.df$z)

diamond_summary <- data.frame(
  Statistic = names(price_summary),
  Price = as.numeric(price_summary),
  Carat = as.numeric(carat_summary),
  Total_Depth_Percentage = as.numeric(depth_summary),
  Table = as.numeric(table_summary),
  Length = as.numeric(x_summary),
  Width = as.numeric(y_summary),
  Depth = as.numeric(z_summary))
diamond_summary 
##   Statistic    Price     Carat Total_Depth_Percentage    Table    Length
## 1      Min.   326.00 0.2000000                43.0000 43.00000  0.000000
## 2   1st Qu.   950.00 0.4000000                61.0000 56.00000  4.710000
## 3    Median  2401.00 0.7000000                61.8000 57.00000  5.700000
## 4      Mean  3932.80 0.7979397                61.7494 57.45718  5.731157
## 5   3rd Qu.  5324.25 1.0400000                62.5000 59.00000  6.540000
## 6      Max. 18823.00 5.0100000                79.0000 95.00000 10.740000
##       Width     Depth
## 1  0.000000  0.000000
## 2  4.720000  2.910000
## 3  5.710000  3.530000
## 4  5.734526  3.538734
## 5  6.540000  4.040000
## 6 58.900000 31.800000
colnames(diamond.df)[colnames(diamond.df) == "depth"] = "T_depth_percentage"
colnames(diamond.df)[colnames(diamond.df) == "x"] = "length"
colnames(diamond.df)[colnames(diamond.df) == "y"] = "width"
colnames(diamond.df)[colnames(diamond.df) == "z"] = "depth"

pairs(price~carat + T_depth_percentage + table + length + width + depth, data=diamond.df) 

colnames(diamond.df)[colnames(diamond.df) == "length"] = "x"
colnames(diamond.df)[colnames(diamond.df) == "width"] = "y"
colnames(diamond.df)[colnames(diamond.df) == "depth"] = "z"
colnames(diamond.df)[colnames(diamond.df) == "T_depth_percentage"] = "depth"

In the process of exploring relationships within our dataset, we created a pairs plot to gain a broad understanding of the correlations between multiple variables. Our analysis revealed that some graphs do not show clear trends. For example, the variable depth does not appear to explain the variation in the variable table, and the reverse relationship also lacks any significant pattern. Additionally, clustering of data is observed between the variables of depth vs. length and table vs. length as well. Based on the initial visualization, there are no significant correlations between these variables that are worth exploring further. However, certain graphs in the pairs plot do exhibit clear trends. For instance, carat and length seem to exhibit a positive correlation regardless of which is treated as the dependent or independent variable. Interestingly, the trend appears to curve for both plots, implying that the relationships are non-linear.

ggplot(diamond.df,aes(x=price))+
  geom_histogram(binwidth=100,fill="lightblue")+
  labs(title="Distribution of Diamond Prices",x="Price",y="Frequency")

To further our analysis, we created a histogram to visualize the distribution of diamond prices. The histogram revealed that diamond prices are right-skewed, indicating that most diamonds in our dataset are priced at the lower end, with a few diamonds at higher price points.

Since the objective of our project is to gain an understanding of how various characteristics of the diamond influence its price, several scatter plots were created to explore how price was impacted by each of these factors. First, we will explore the impact of quantitative variables on the price, before moving on to explore qualitative variables.

The first plot is a Scatter plot of diamond price versus carat. As expected, there seems to be a positive correlation between the two variables, such that as the weight of the diamond increase, the price of the diamond also increases. While this would be a solid relationship to explore, its obviousness leads us decide not to pursue this further, as we aim to offer new insights into the beyond this well-established factor.

library(ggplot2)
ggplot(diamond.df, aes(x = carat, y = price)) + geom_point(color = "blue") + labs(title = "Scatterplot of Diamond Price and Carat", x = "Carat (Weight of the Diamond)", y = "Price (in US Dollars)") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 15),axis.title.x = element_text(face = "bold", size = 12),axis.title.y = element_text(face = "bold", size = 12))        

A scatter plot of depth percentage versus price of was also generated, where price is the response variable and total depth percentage is the explanatory variable. The shape of the plot resembles a vertical band which indicates little to no correlation between these two variables. At each specific depth percentage value, the prices seem to have a wide range, indicating that depth is likely not a key factor that impacts diamond price.

ggplot(diamond.df, aes(x = depth, y = price)) + geom_point(color = "blue") + labs(title = "Scatterplot of Diamond Price and Total Depth Percentage of Diamond", x = "Total Depth Percentage of Diamond (%)", y = "Price (in US Dollars)") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 12),axis.title.x = element_text(face = "bold", size = 12),axis.title.y = element_text(face = "bold", size = 12))    

The plot below of diamond price versus table displays a similar pattern as the diamond price versus total depth percentage plot. Once again, there is no recognizable trend present, different values of table can correspond to the same price. Therefore we can conclude that table is not a factor influencing diamond price.

ggplot(diamond.df, aes(x = table, y = price)) + geom_point(color = "blue") + labs(title = "Scatterplot of Diamond Price and Table", x = "Table (mm)", y = "Price (in US Dollars)") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 15),axis.title.x = element_text(face = "bold", size = 12),axis.title.y = element_text(face = "bold", size = 12))   

In contrast, the scatter plot of diamond price versus diamond length, reveals a strong positive correlation. As diamond length increase, so to does the price of the diamond. This indicates that length might be a potential influencing factor for price, however, the trend appears to curve upwards indicating that relationship is likely non-linear.

ggplot(diamond.df, aes(x = x, y = price)) + geom_point(color = "blue") + labs(title = "Scatterplot of Diamond Price and Length of Diamond", x = "Diamond Length (mm)", y = "Price (in US Dollars)") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 15),axis.title.x = element_text(face = "bold", size = 12),axis.title.y = element_text(face = "bold", size = 12))   

The effect of diamond width on price was also examined. The plot below is indicative of a positive relationship, where increases in diamond width are associated with an increase in price. Despite a few outliers, the main trend of the plot forms a line, suggesting a linear relationships exist between the two variables. It is also worth noting that the line is very steep, showing a strong correlation. This suggests that even a small increase in width may cause a significant increase in price.

ggplot(diamond.df, aes(x = y, y = price)) + geom_point(color = "blue") + labs(title = "Scatterplot of Diamond Price and Width of Diamond", x = "Diamond Width (mm)", y = "Price (in US Dollars)") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 15),axis.title.x = element_text(face = "bold", size = 12),axis.title.y = element_text(face = "bold", size = 12))   

Lastly,a plot of the depth of the diamond versus diamond price was generated. This plot closely resembles the price versus width plot, displaying increases in price with corresponding in increase in the explanatory variable (depth). The positive correlation appears to be linear as well, indicating that depth may be a reliable predictor of price, although some outliers and anomalies were present when depth values were zero.

ggplot(diamond.df, aes(x = z, y = price)) + geom_point(color = "blue") + labs(title = "Scatterplot of Diamond Price and Depth of Diamond", x = "Depth of Diamond (mm)", y = "Price (in US Dollars)") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 15),axis.title.x = element_text(face = "bold", size = 12),axis.title.y = element_text(face = "bold", size = 12))   

To complement the scatter plots create above, we generated a correlation heatmap to visualize the strength of the relationships observed.

library(ggcorrplot)
corr_data = diamond.df[, sapply(diamond.df, function(col) is.numeric(col) || is.integer(col))]
correlation_matrix = cor(corr_data)
ggcorrplot(correlation_matrix, method = "circle", lab = TRUE, lab_size = 3, colors = c("orange", "white", "lightblue"), title = "Correlation Matrix", ggtheme = theme_minimal())

The heat map confirms that depth (Z), width (Y), and length (X) all have strong correlations with price. Carat also exhibits a strong correlation, though, as noted, we chose not to focus on it for originality.

Having examined the impact of quantitative factors on price, we proceed to analyzing the distribution of diamond price with respect to qualitative factors.

diamond.df$cut <- factor(diamond.df$cut, levels = c("Fair", "Good", "Very Good", "Premium", "Ideal")) 

box_cut <- ggplot(diamond.df, aes(x = factor(cut), y = price, fill = factor(cut))) + geom_boxplot() + 
labs(title = "Box Plot of Price Distribution based on Diamond Cut Quality",
x = "Diamond Cut Quality",
y = "Price",
fill = "Diamond Cut Quality") + theme(
    plot.title = element_text(hjust = 0.5, size = 13, face = "bold"),  
    axis.title.x = element_text(face = "bold", size = 12),             
    axis.title.y = element_text(face = "bold", size = 12),             
    legend.title = element_text(face = "bold"))
print(box_cut)

A box plot was created to visualize the distribution of diamond prices based on diamond cut quality, comparing five different cut grades: Fair, Good, Ideal, Premium, and Very Good. The plot displays that the median prices are fairly consistent across different qualities with the exception of ideal cut, which has the lowest median price. Premium cut diamond had the longest box, indicating the most diverse central values. A large range was also observed for this cut grade, as is evidenced by this box’s interquartile range being the longest. In contrast, the Fair cut category displays a smaller interquartile range, indicating that the prices are more consistent across this cut quality.

diamond.df$clarity <- factor(diamond.df$clarity, levels = c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"))
box_clarity <- ggplot(diamond.df, aes(x = factor(clarity), y = price, fill = factor(clarity))) + geom_boxplot() +
labs(title = "Box Plot of Distribution of Diamond Price based on Diamond Clarity",
x = "Diamond Clarity",
y = "Price",
fill = "Diamond Clarity") + theme(
    plot.title = element_text(hjust = 0.5, size = 13, face = "bold"),  
    axis.title.x = element_text(face = "bold", size = 12),             
    axis.title.y = element_text(face = "bold", size = 12),             
    legend.title = element_text(face = "bold")  )    
print(box_clarity)

The effect of clarity on diamond price was also investigated through the creation of a box plot. The clarity of a diamond measures its transparency, with I1 being the lowest grade and IF the best (Gemological Institute of America, n.d.). The box-plots for VS2 and VS1 clarity grades are almost identical and have the longest interquartile range, suggesting that middle grade clarity categories have the widest central price range. Additionally, their long whiskers suggest that there is significant variability in diamond prices for middle grade clarity categories. In contrast, IF and VVS1 clarity grades display the tightest distributions of prices, evidenced by their compressed IQR and shorter whiskers. Despite these observation, there is no consistent trend in prices based on clarity.

Building on the exploratory data analysis presented above and our project’s goal of understanding how various diamond characteristics influence their prices, we have chosen to investigate the impact of depth and width on diamond prices. Both factors appear to have a linear relationship with price, making them ideal for a linear regression analysis. Furthermore, these factors are not as well explored as the impact of the weight of the diamond, allowing us to potentially uncover novel trends.

Regression Analysis 1:

Based on the exploratory data analysis presented above, there appears to be a linear relationship between the width of the diamond (mm) and the price of the diamond ($). To further confirm this relationship, the correlation coefficient will be determined. The correlation coefficient, or r, represents the strength and direction of our relationship. Values closer to -1 indicate a strong negative linear relationship, while values closer to 1 indicate a strong positive linear relationship. Conversely, values that are closer to 0, indicate weak relationships with the direction depending on the sign. Before proceeding with our analysis, we will assess whether the correlation coefficient reflects a strong relationship between diamond width and price.

price = diamond.df$price
width = diamond.df$y
cor(width,price)
## [1] 0.8654209

The correlation coefficient is 0.865. The positive value indicates that the relationship is positive, as observed in the scatter plot. Furthermore, the coefficient’s closeness to 1 indicates that this is a strong linear relationship.

Having established the potential for a linear relationship between the width of the diamond and the price. The analysis will proceed with finding the coefficent estimates of \(B_0\) or \(B_1\).

OLSmodel_1 = lm(diamond.df$price ~ diamond.df$y, data = diamond.df)
print(OLSmodel_1)
## 
## Call:
## lm(formula = diamond.df$price ~ diamond.df$y, data = diamond.df)
## 
## Coefficients:
##  (Intercept)  diamond.df$y  
##       -13402          3023
summary(OLSmodel_1)
## 
## Call:
## lm(formula = diamond.df$price ~ diamond.df$y, data = diamond.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -152436   -1229    -241     838   31436 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13402.027     44.062  -304.2   <2e-16 ***
## diamond.df$y   3022.887      7.536   401.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1999 on 53938 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7489 
## F-statistic: 1.609e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

The estimates for the coefficients are as follows: \(B_0\) = -13042.027 and \(B_1\) = 3022.887. The derived equation can therefore be modeled as follows: \(\text{Price} = 3022.887 \times \text{width} - 13042.027\) The interpretation of these values is as follows: \(B_0\) represents the intercept, which indicates what the price of the diamond would be if the width is 0. While this scenario is not practically possible, for the sake of interpretation, it means that when the width of the diamond is 0 mm, the price is $-13042.027. The \(B_1\) coefficient, or the slope, tells us how the price changes for each unit change in the width. Since the width is measured in millimeters, this implies that for every one millimeter increase in width, the price will increase by $3022.887. The positive slope reinforces our initial observation, of the relationship between the two variables being positive.

Based on this we can also answer our research question of if a relationship exists for either variable, what direction is it in, and how strong is it? From the correlation coefficient, we determined that the relationship is a strong positive linear one. Similarly, the positive slope further supports this conclusion.

Now that we have our interpreted our model, we can evalute our model through a visualization.

ggplot(diamond.df, aes(x = y, y = price)) +
  geom_point(shape = 1, size = 1, color = "blue") +
  geom_abline(intercept = -13402, slope = 3023, color = "red", size = 1) +  
  labs( title = "Scatter Plot of Price of Diamond ($) and Width of Diamond (mm)",
    x = "Width of the Diamond (mm)", 
    y = "Price of the Diamond ($)" 
  )  +  
  theme_minimal()

From the plot, we can see that all the points are scattered around the regression line, which indicates a good fit of the model.

Based on this, we can proceed with testing the assumptions for the model. To do this we examine the residuals, which represent the random error term of our equation. The residuals are calculated by subtracting the observed values from our data set, from the predicted values derived from our equation. Essentially they represent the fit of our model, as smaller residual values indicate a better fit. The assumptions for applying the regression model and hypothesis testing all concern the residuals. The three assumption are as follows: the residuals are independent, this means that we should not see any systematic pattern between the residuals. A residual for one observation should be independent of residuals for other observations. Next the residuals should be normally distributed, and lastly the residuals should have equal variance. These results will be used to answer our research question of if the assumptions for linear regression are met.

predicted_value_1 = OLSmodel_1$fitted.values
residual_1 = OLSmodel_1$residuals

1.Residuals independence test:

(To ensure the abnormal data points don’t influence the visualization, 2e cleaned the abnormal data points.)

Visualization of residuals versus predicted values:

dia_noabnormal <- diamond.df[-c(49190, 24068,27430,26244,24521,15952,11964,49557,49558), ] 
plot(lm(price ~ y, data = dia_noabnormal), which =1)

The points are distributed around the zero line and the red line, similarly the residuals do not appear to follow any distinctive patterns which indicates that the residuals are indeed independent.

2.Residuals normally distributed test:

Q-Q plot:

dia_noabnormal <- diamond.df[-c(49190, 24068), ] 
plot(lm(price ~ y, data = dia_noabnormal), which =2)

Based on the qqplot, we can conclude that the residuals are not normally distributed. We can see that there is quite a bit of deviation in the tails, which indicates that there is not a normal distribution.

3.Residuals equal variance test:

Scale-Location Plot:

dia_noabnormal <- diamond.df[-c(49190, 24068,27430,26244,24521,15952,11964,49557,49558), ] 
plot(lm(price ~ y, data = dia_noabnormal), which =3)

The points are distributed around the zero line and the red line and have no obvious trends or patterns, based on this we can conclude that the residuals have equal variance.

Therefore, we see that the assumptions for equal variance and independent residuals are met, but the residuals are not normally distributed.

Hypothesis Testing For Width: Having obtained our coefficient estimates and tested the assumptions, we can conduct hypothesis testing to determine the significance of these estimates. This step is essential because our coefficient estimates are based on the available data, or our sample. If the sample were to change, the coefficient estimates might also change. Therefore, we apply our knowledge of hypothesis testing to evaluate whether the actual coefficients (i.e., population coefficients) are significant.

We will start with \(B_1\), or the slope. As stated above, this variable represents the change in the dependent variable, for a one unit change in the independent variable. In this case the null hypothesis is that \(H_0: B_1 = 0\), or that there is no relationship between the independent variable (the width of the diamond), and the dependent variable (the price of the diamond). Alternatively our alternative hypothesis is \(H_A: B_1 > 0\), or that there is positive relationship between the independent variable (width of diamond) and dependent variable (price of diamond).

Similarly, we can specify the null and alternative for testing the significance of \(B_0\), or the intercept. This variable represents the price of the diamond, when the width is 0. The null hypothesis is \(H_0: B_0 = 0\),meaning that the true price of diamonds when width is 0, is 0. The alternative hypothesis is \(H_A: B_0 \neq 0\), indicating that the true price of diamonds is not equal to 0, when the width is 0.

These hypothesis tests will aid us in answering the research questions of whether width is a predictor of diamond price, whether there is a linear relationship between the width of the diamond and its price, and what the nature of that relationship is. These questions will be addressed through the hypothesis test for the slope coefficient, or \(B_1\). If we find evidence against the null hypothesis, this will indicate that changes in width have a positive impact on the price of diamonds, and can thus be considered a predictor of price.

summary(OLSmodel_1)
## 
## Call:
## lm(formula = diamond.df$price ~ diamond.df$y, data = diamond.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -152436   -1229    -241     838   31436 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13402.027     44.062  -304.2   <2e-16 ***
## diamond.df$y   3022.887      7.536   401.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1999 on 53938 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7489 
## F-statistic: 1.609e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
# p value for $B_1$
1-pt(401.1, OLSmodel_1$df)
## [1] 0

The p-value for the slope is is 0. The hypothesis tested was that the slope is 0, or that width has no effect on price. This means that the probability of observing a slope coefficient as or more extreme as the one observed through our data given that the null is true (there is no relationship) is 0. Since this value is 0, it indicates that there is a zero probability of observing a \(B_1\) coefficient equal to or more extreme than the one calculated from our data, if the true relationship were non-existent (i.e., if width had no effect on price). From here we can conclude that our slope coefficient is significant based on a 5% level of significance. In other words we can accept the alternative, that there is a positive linear relationship between the width of the diamond and the price. This finding directly answers our research questions. The coefficient for slope being significantly different from 0, indicates that there is a positive linear relationship. Additionally, the \(R^2\) value of 0.749 indicates that approximately 74.9% of the variance in diamond price can be explained by its width, allowing us to conclude that width is an accurate predictor of price. Similarly, this also confirms our original hypotheses, that as the width of the diamond increases, so to will the price, as the slope is significantly greater than 0.

Similarly, the p-value for the intercept, or \(B_0\), is less than 2.2e-16. Once again, we tested the hypothesis that the true value of \(B_0\) is equal to 0. This means that there is essentially a zero probability of observing an intercept coefficient as extreme as or more extreme than the one observed through our data, given that the null hypothesis is true (the price of diamonds is 0 when the width is 0), indicating that the null hypothesis is likely false. As a result, we reject the null hypothesis at a significance level of 0.05 and accept the alternative hypothesis that the true intercept is not equal to 0. In other words, the price of the diamond is non-zero when the width is 0.

However, it is important to note that the assumption for normality was not met, indicating that these results should be interpreted with caution.

Model improvements

While linear regression for Model 1 appeared to provide a good fit, there are potential ways to improve our model. Particularly, if we zoom in on the plot below, we can observe that in the area where the data is most concentrated, the points appear to curve upwards, indicating an exponential trend.

library(ggplot2)
ggplot(diamond.df, aes(x = y, y = price)) +
  geom_point(shape = 1, size = 1, color="blue") +
  labs(
    x = "Width of the Diamond", 
    y = "price" 
  ) + xlim(0, 10) +
  theme_minimal()

In this case, applying a log transformation to the price may be beneficial, as this transformation can help linearize relationships that exhibit exponential growth while also addressing non-normality by compressing the range (Osborne, 2002). We can see how the data changes by applying this transformation below.

diamond.df$log_price = log(diamond.df$price)
ggplot(diamond.df, aes(x = y, y = log_price)) +
  geom_point(shape = 1, size = 1, color="blue") +
  labs(title = "Scatter Plot of Width of Diamond (mm) vs Log Price",
    x = "Width of the Diamond (mm)", 
    y = "Log of Price" 
  ) + xlim(0, 10) +
  theme_minimal()

The resulting plot shows a much stronger relationship, suggesting that the transformation may enhance our model. To further assess this improvement, we can model the correlation coefficient and see if it has improved as well.

log_price = diamond.df$log_price
cor(width, log_price)
## [1] 0.9361729

From this analysis, we see that the correlation coefficient has indeed increased, from 0.865 to 0.936. Based on this result, we can proceed to fit a linear regression to the width versus the log of the price.

Log price with width of the diamond:

OLSmodel_log1 = lm(log_price ~ y, data = diamond.df)
print(OLSmodel_log1)
## 
## Call:
## lm(formula = log_price ~ y, data = diamond.df)
## 
## Coefficients:
## (Intercept)            y  
##      3.0175       0.8317
summary(OLSmodel_log1)
## 
## Call:
## lm(formula = log_price ~ y, data = diamond.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.593  -0.178   0.001   0.174   6.783 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.017495   0.007863   383.8   <2e-16 ***
## y           0.831677   0.001345   618.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3567 on 53938 degrees of freedom
## Multiple R-squared:  0.8764, Adjusted R-squared:  0.8764 
## F-statistic: 3.825e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

As seen above, our estimate for \(B_1\) is 0.831677 and the estimate for \(B_0\) is 3.017495. This makes our equation:\(\log(\text{price}) = (0.831677 \times \text{width}) + 3.017495\). The interpretation of these coefficients is that for every one mm increase in the width of the diamond, the logarithm of the price increases by 0.831677. Additionally, when the width is 0, the logarithm of the price is 3.017495. Furthermore, both p-values in the summary section are less than 0.05, as they are stated to be <2e-16, indicating that both coefficients are significant (thus, we reject the null hypotheses of H0: \(B_1\)=0 and H0: \(B_0\) = 0)

To further illustrate our findings, we can visualize the data with the line from our equation added:

Visualization:

ggplot(diamond.df, aes(x = y, y = log_price)) +
  geom_point(shape = 1, size = 1, color="blue") +
  geom_abline(intercept = 3.0175, slope = 0.8317, color = "red", size = 1) +
  labs(
    x = "Width of the Diamond", 
    y = "Logarithm of Price" 
  ) + xlim(0, 10) +
  theme_minimal()

Based on the plot above, it is evident that the model is a good fit as the points cluster around the line. We can now move on to test the assumptions of the model.

Residual Analysis

predicted_value_log1 = OLSmodel_log1$fitted.values
residual_log1 = OLSmodel_log1$residuals

1.Residuals independence test:
(To ensure the abnormal data points don’t influence the visualization, we have cleaned them)

Visualization of residuals versus predicted values:

dia_noabnormal <- diamond.df[-c(49190, 24068,27430,26244,24521,15952,11964,49557,49558,27416,15236,17827), ] 
plot(lm(log_price ~ y, data = dia_noabnormal), which =1)

The points are distributed around the zero line and the red line, additionally, the residuals do not appear to follow any clear patterns, this indicates that the residuals are independent.

2.Residuals normally distributed test:

Q-Q plot:

dia_noabnormal = diamond.df[-c(49190, 24068), ]
plot(lm(log_price ~ y, data = dia_noabnormal), which =2)

The plot above displays that the normality of the data has significantly improved. Most of the points appear to be on the line, with a few minor deviations, but overall the assumption of normality is met.

3.Residuals equal variance test:
Scale-Location Plot:

dia_noabnormal = diamond.df[-c(49190, 24068), ]
plot(lm(log_price ~ y, data = dia_noabnormal), which =3)

The points are distributed around the zero line and the red line and have no obvious trends or patterns. The Residuals have equal variance.

The analysis above indicates that applying a log transformation results in a better fitting model, based on the significant p-values and increased correlation coefficient. Furthermore, all assumptions are now met.

Regression Analysis 2:

For our second analysis, we will explore the impact of diamond depth (mm) on price. Based on the exploratory data analysis (EDA), we determined that the scatter plot between diamond depth and price exhibited a linear relationship. As in the first regression analysis, we will calculate the correlation coefficient to assess the strength and direction of the relationship between diamond depth and price ($).

depth = diamond.df$z
cor(depth,price )
## [1] 0.8612494

The correlation coefficient is found to be 0.8612494, indicating a positive relationship between the depth of the diamond (mm) and its price ($). Additionally, the value being close to 1 suggests that this is a strong linear relationship.

Now that we have confirmed that a linear relationship does indeed exist, we can carry out our regression analysis.

depth = diamond.df$z
OLSmodel_2 = lm(diamond.df$price ~ diamond.df$z, data = diamond.df)
print(OLSmodel_2)
## 
## Call:
## lm(formula = diamond.df$price ~ diamond.df$z, data = diamond.df)
## 
## Coefficients:
##  (Intercept)  diamond.df$z  
##       -13297          4869
summary(OLSmodel_2)
## 
## Call:
## lm(formula = diamond.df$price ~ diamond.df$z, data = diamond.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -139561   -1235    -240     825   32085 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13296.57      44.64  -297.9   <2e-16 ***
## diamond.df$z   4868.79      12.37   393.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2027 on 53938 degrees of freedom
## Multiple R-squared:  0.7418, Adjusted R-squared:  0.7417 
## F-statistic: 1.549e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

As seen above, the estimated coefficient for \(B_1\) or slope is 4868.79, while the estimated coefficient for \(B_0\) or the intercept is -13296.57. These results yield the following equation: \(\text{Price} = (4868.79 \times \text{depth}) - 13296.57\).The slope can be interpreted as follows: for every one mm increase in diamond depth, the price will increase by 4868.79 dollars. The slope coefficient aligns with our original observations of the scatter plot and correlation coefficient, as it indicates a positive relationship. The intercept coefficient indicates that when the depth of the diamond is 0, the. price will be $-13296.57. It is important to note that this does not really provide useful information, in that the depth of the diamond can never be 0, and similarly, the price can also never be negative.

Now that we have our model, we can visualize how well our predicted line fits to our observed data through a visualization.


ggplot(diamond.df, aes(x = z, y = price)) +
  geom_point(shape = 1, size = 1, color = "blue") +
  geom_abline(intercept = -13297, slope = 4869, color = "red", size = 1) +  
  labs(
    x = "Depth of the Diamond", 
    y = "Price" 
  )  +  
  theme_minimal()


The observed points appear to cluster around the line predicted by linear regression model. This indicates that the fit of the model is good.

Based on the analysis above, we can proceed to test the assumptions of the linear regression model. The residuals should be independent, normally distributed, and have equal variance.

predicted_value_2 = OLSmodel_2$fitted.values
residual_2 = OLSmodel_2$residuals


1.Residuals independence test:
(To ensure the abnormal data points don’t influence the visualization, we have cleaned the abnormal data points.)

dia_noabnormal = diamond.df[-c(48411,27740,27113,26244,27504,27430,26124,24521,24395,15952,13602,10164,10168,5472,11964,4792,11183,2208,2315,21655,51507,49558,49557,20695,14636,24068), ] 
plot(lm(price ~ z, data = dia_noabnormal), which =1)

The residuals appear to be independent, the points are distributed around the zero line and the red line. Furthermore, no distinct pattern is seen further confirming the independence.

2.Residuals normally distributed test:
Q-Q plot:

dia_noabnormal = diamond.df[-c(48411), ] 
plot(lm(price ~ z, data = dia_noabnormal), which =2)

The QQ plot above displays that while the center of the distribution appears to follow the line, the tails appear to heavily deviate from it. This indicates that the residuals are not normally distributed, indictaing that the assumption of normality is violated.

3.Residuals equal variance test:
Scale-Location Plot:

dia_noabnormal = diamond.df[-c(48411), ] 
plot(lm(price ~ z, data = dia_noabnormal), which =3)

The scale-location plot shows points are distributed around the zero and red line and have no obvious trends or patterns. This indicates that the residuals have equal variance.

Our results for assumption testing are similar to before, we conclude that the assumption of equal variance and independence are met, but the assumption for normality is violated.

Hypothesis Testing: The hypothesis testing conducted below will indicate if our coefficient estimates are significant.The tests conducted will be similar to the ones conducted for the width of the diamond. We want to test to see if the actual coefficients (i.e. population coefficients so to speak) are significant.

We will start with \(B_1\), or the slope. This variable represents the change in the price, for a one unit change in the depth of the diamond. Our null hypothesis is that there is no relationship between the depth of the diamond and the price of the diamond or :$ H_0: B_1 = 0$. Alternatively our alternative hypothesis is that there is a positive relationship between the depth of the diamond and the price of the diamond \(H_A: B_1 > 0\).

Additionally, a hypothesis test will also be conducted to test the significance of \(B_0\) , or the intercept. This variable represents the price of the diamond, when the depth is 0. In this case our null is \(H_0: B_0 = 0\), or that the true price of diamonds when depth is 0, is 0 dollars. The alternative is \(H_A: B_0 \neq 0\), or that the true price of diamonds is not equal to 0, when the depth is 0.

The hypothesis test for \(B_1\) will be used to answer our remaining research questions of whether depth is a predictor of diamond, whether there is a linear relationship between the depth of the diamond and its price, and what the nature of the relationship is. Finding evidence against the null hypothesis will be indicative of changes in depth having a positive impact on price, and the depth being a predictor of price.

summary(OLSmodel_2)
## 
## Call:
## lm(formula = diamond.df$price ~ diamond.df$z, data = diamond.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -139561   -1235    -240     825   32085 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13296.57      44.64  -297.9   <2e-16 ***
## diamond.df$z   4868.79      12.37   393.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2027 on 53938 degrees of freedom
## Multiple R-squared:  0.7418, Adjusted R-squared:  0.7417 
## F-statistic: 1.549e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
1 - pt(393.6, OLSmodel_2$df)
## [1] 0

The p-value for the slope is 0. The hypothesis tested was that the slope is 0, or that depth has no effect on price. The result indicates that the probability of observing a slope as or more extreme as the one observed through our data given that the null is true (there is no relationship, or slope is 0) is 0. The probability being 0 indicates that there is no chance of observing our coefficient estimate of the slope if the true relationship were non-existent (i.e., if depth had no effect on price). Therefore, our slope coefficient is significant based on a 5% level of significance. In other words we can accept the alternative, that there is a positive relationship between the depth of the diamond and the price. Furthermore, we can answer our research questions. The coefficient of the slope is found to be significantly greater than 0, which indicates a positive linear relationship. Additionally, the \(R^2\) value of 0.7418 indicate that 74.18% of the variance in diamond price can be explained by the depth, suggesting that the depth is an accurate predictor of price. Finally, these findings also confirm our original hypothesis that as the width of the diamond increases, so to will the price, as we have concluded that \(B_1\) is greater than 0.

The p-value for the intercept, or \(B_0\), is less than 2.2e-16. The null hypothesis tested is that the true value of \(B_0\) is equal to 0. This means that the probability of observing a coefficient estimate for the intercept as or more extreme as the one observed through our data, given that the null is true (the price of diamonds is 0 when the depth is 0), is essentially zero. Therefore, our p-value indicates that null is likely false. That is to say, the chances of observing our coefficient value are extremely small in a world that the null is true, making it highly unlikely for the null to be true. This means that we reject the null with a significance of 0.05, and accept the alternative that the true intercept does not equal to 0. In other words, the price of the diamond is non-zero when the depth is 0.

However, it is important to note that the assumption of normality was not met, indicating that these results should be interpreted with caution.

Model Improvements:

Our linear regression model for depth of a diamond and its price appeared to be a good fit, however, the assumption of normality was violated. Therefore, there may be potential ways to improve our model. Notably, if we zoom in on the area where the data is concentrated, the points appears to follow a curve, indicating an exponential trend.

ggplot(diamond.df, aes(x = z, y = price)) +
  geom_point(shape = 1, size = 1, color="blue") +
  labs(
    x = "Depth of the Diamond", 
    y = "price" 
  ) + xlim(0, 10) +
  theme_minimal()

This case is similar to the scenario of the width of the diamond and the price, therefore a log transformation may be fitting to normalize the data and provide a better fitting model. We can see how the data changes by applying the transformation and then plotting it below:

diamond.df$log_price = log(diamond.df$price)
ggplot(diamond.df, aes(x = z, y = log_price)) +
  geom_point(shape = 1, size = 1, color="blue") +
  labs(
    x = "Width of the Diamond", 
    y = "Logarithm of Price" 
  ) + xlim(0, 10) +
  theme_minimal()

The scatter plot shows a stronger, more linear relationship between the two variables. This suggest that the log transformation may indeed enhance our model. To further asses if the transformation will result in a better relationship, the correlation coefficient is calculated below:

cor(depth, log_price)
## [1] 0.9352178

The correlation coefficient between the log of the price and the depth of the diamond is 0.9352178. This value indicates a strong positive linear relationship. Additionally, this value is improved compared to the correlation coefficient between price and depth, which was 0.8612494. Based on both the correlation coefficient and the scatter plot, we can proceed with fitting the relationship to a linear model:

OLSmodel_log2 = lm(log_price ~ z, data = diamond.df)
print(OLSmodel_log2)
## 
## Call:
## lm(formula = log_price ~ z, data = diamond.df)
## 
## Coefficients:
## (Intercept)            z  
##       3.028        1.345
summary(OLSmodel_log2)
## 
## Call:
## lm(formula = log_price ~ z, data = diamond.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.203  -0.183   0.001   0.178   6.813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.028409   0.007910   382.9   <2e-16 ***
## z           1.344650   0.002192   613.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3593 on 53938 degrees of freedom
## Multiple R-squared:  0.8746, Adjusted R-squared:  0.8746 
## F-statistic: 3.763e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

The estimate for \(B_1\) or the slope is 1.344650, while the estimate for intercept or \(B_0\) is 3.028409. This makes our new equation: \(\log(\text{price}) = (1.344650 \times \text{depth}) + 3.028409\). The coefficient for slope indicates that for every 1 mm increase in depth, the log of the price will increase by 1.344650. Similarly the estimation for \(B_0\) indicates that when the depth of the diamond is 0, the log of the price will be 3.028409. Furthermore, both p-values in the summary section are less than 0.05, as they are stated to be <2e-16, indicating that both coefficients are significant. We can reject the null hypotheses (\(B_1\) = 0 and \(B_0\) =0), at a significance level of 0.05. Finally, the \(R^2\) value has also imrpoved, from

To further illustrate the model, we can visualize the observed data with the line from our equation added:

Visualization:

ggplot(diamond.df, aes(x = z, y = log_price)) +
  geom_point(shape = 1, size = 1, color="blue") +
  geom_abline(intercept = 3.028, slope = 1.345, color = "red", size = 1) +
  labs(title = "Depth of the Diamond (mm) vs Log of Price",
    x = "Depth of the Diamond (mm)", 
    y = "Log of Price" 
  ) + xlim(0, 10) +
  theme_minimal()

The plot above indicates that the model is a good fit, as all the points appear to be clustered around the line. Based on our analysis we can proceed with testing the assumptions of normality, independence, and equal variance.

Residual Analysis

predicted_value_log2 = OLSmodel_log2$fitted.values
residual_log2 = OLSmodel_log2$residuals

1.Residuals independence test:
(To ensure the abnormal data points don’t influence the visualization, we cleaned the abnormal data points.)

Visualization of residuals versus predicted values:

dia_noabnormal = diamond.df[-c(48411,27740,27113,26244,27504,27430,26124,24521,24395,15952,13602,10164,10168,5472,11964,4792,11183,2208,2315,21655,51507,49558,49557,20695,14636,24068,27631,27416,34283,49906,49190,23645), ] 
plot(lm(log_price ~ z, data = dia_noabnormal), which =1)

The points are distributed around the zero line and the red line. Aditionally, the residuals do not appear to follow a pattern, both of these indicate that the residuals are independent.

2.Residuals normally distributed test:
Q-Q plot:

dia_noabnormal = diamond.df[-c(48411), ] 
plot(lm(log_price ~ z, data = dia_noabnormal), which =2)

The qqplot displays that the points are distributed around the reference line, and the plot appears to be improved when compared to the original. There are a few outliers and some minor deviations at the tails, but overall the residuals are normally distributed.

3.Residuals equal variance test:
Scale-Location Plot:

dia_noabnormal = diamond.df[-c(48411), ] 
plot(lm(log_price ~ z, data = dia_noabnormal), which =3)

The points are distributed around the zero line and the red line and have no obvious trends or patterns. The Residuals have equal variance.

Once again, we found that through applying a log transformation to the data the model appears to be much improved. The assumption of normality is now meet, and the correlation coefficient has increased. This indicates that the log transformation results in a better model

Conclusion:

Our analysis confirmed a positive relationship between both width and depth and the price of diamonds. As the width or depth of a diamond increases, its price tends to rise as well. The original coefficient for width was 0.8654, while the original coefficient for depth was 0.8612, indicating a strong linear relationship between width and price, as well as depth and price. Additionally the slopes for both linear regression models were positive, and were found to be significant based on hypothesis testing. This suggests that both width and depth are reliable predictors of a diamond’s price, with width having a stronger influence on determining the overall value of a diamond. Furthermore, the model explained approximately 75% of the price variation for diamonds with average widths and depths, as indicated by the \(R^2\) value, which demonstrates a good fit. Therefore, our results are in line with our original hypotheses, which predicted that as both the width and depth increase so too will the price.

While the model met most assumptions, the assumption of normality was violated. To address this, we applied a log transformation to the price variable, which improved the coefficients for width and depth to 0.9362 and 0.9352, respectively. Additionally, the transformation ensured that all assumptions of the model were satisfied. Despite the model’s good fit for diamonds with average width and depth, it struggled with extreme values, particularly for diamonds that are very wide, narrow, or deep. These extreme cases did not follow the same patterns as the majority of the data. The presence of outliers affected the model’s performance, as evidenced by the residual patterns and Q-Q plots.

We have successfully answered all our research questions. Moving forward, we recommend using the bootstrap approach to recalculate the beta estimates. By doing so, we could have more accurate significance tests for beta coefficients without relying on normality. In addition to bootstrapping, we recommend exploring multivariate models to improve the model’s ability to handle extreme values. This approach involves incorporating additional factors that may affect diamond prices, such as carat, clarity, or cut. By including these additional predictors, the model would capture more variation in price, particularly for diamonds that fall outside the average range, leading to improved accuracy and better predictions for extreme cases. Furthermore, we suggest applying additional adjustments to better handle outliers. Using non-linear models or robust regression techniques may help mitigate the impact of outliers and enhance the overall fit of the model for diamonds that do not conform to standard patterns. Implementing these recommended adjustments will enhance the accuracy and predictive power of future models, providing more reliable estimates across various price ranges and diamond characteristics. This improvement will lead to better insights into diamond valuation, benefiting both buyers and sellers in the market.

References:

Gemological Institute of America (GIA). (n.d.). The 4Cs of Diamond Quality. Retrieved from GIA

Gem Society. (n.d.). Understanding Diamond Measurements: Depth and Table. Retrieved from Gem Society

Gemological Institute Of America. (GIA) (n.d.). Gia 4Cs clarity. https://www.gia.edu/gia-about/4cs-clarity#:~:text=The%20GIA%20Clarity%20Scale%20contains,visible%20under%2010%C3%97%20magnification.

Osborne, J., (2002) “Notes on the use of data transformations”, Practical Assessment, Research, and Evaluation 8(1): 6. doi: https://doi.org/10.7275/4vng-5608

Market data and performance for 1 carat natural diamond prices. easily compare diamond prices and performance across all shapes and sizes. The World’s Largest Diamond Marketplace. (2024, October 11). https://www.stonealgo.com/diamond-prices/1-carat-diamond-prices/

Simpson, G. (2024, April 3). Diamond Depth and Table: Understanding Their Impact on Brilliance and Sparkle. Diamondrensu. https://diamondrensu.com/blogs/education/diamond-depth-and-table?srsltid=AfmBOopb31eAfsfzDd8NjUHTAdd0qcCgrp6g4eNbXJZ5yzlqZko_NJxr

Sor, J. (2024, January 21). Why diamonds are about to get a lot more expensive. Business Insider. https://markets.businessinsider.com/news/commodities/diamond-prices-engagement-ring-carat-value-outlook-recession-mining-supply-2024-1

Canada diamond engagement ring market size, share & forecast. Verified Market Research. (2022, September 9).https://www.verifiedmarketresearch.com/product/canada-diamond-engagement-ring-market/

Agrawal, S. (2017, May 25). Diamonds. Kaggle. https://www.kaggle.com/datasets/shivam2503/diamonds